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Abstract 

We present algorithms for topic modeling 
based on the geometry of cross-document 
word-frequency patterns. This perspective 
gains significance under the so called sepa- 
rability condition. This is a condition on ex- 
istence of novel- words that are unique to each 
topic. We present a suite of highly efRcient 
algorithms based on data-dependent and ran- 
dom projections of word-frequency patterns 
to identify novel words and associated topics. 
We will also discuss the statistical guarantees 
of the data-dependent projections method 
based on two mild assumptions on the prior 
density of topic document matrix. Our key 
insight here is that the maximum and mini- 
mum values of cross-document frequency pat- 
terns projected along any direction are as- 
sociated with novel words. While our sam- 
ple complexity bounds for topic recovery are 
similar to the state-of-art, the computational 
complexity of our random projection scheme 
scales linearly with the number of documents 
and the number of words per document. We 
present several experiments on synthetic and 
real-world datasets to demonstrate qualita- 
tive and quantitative merits of our scheme. 



1. Introduction 

We consider a corpus of M documents composed of 
words chosen from a vocabulary of W distinct words 
indexed by u> = 1, . . . , W. We adopt the classic "bags 
of words" modeling paradigm widely-used in proba- 
bilistic topic modeling (Blei. 2012). Each document is 
modeled as being generated by N independent and 
identically distributed (iid) drawings of words from 
an unknown W x 1 document word-distribution vec- 
tor. Each document word-distribution vector is it- 
self modeled as an unknown probabilistic mixture of 



K < min(Af, W) unknown W x 1 latent topic word- 
distribution vectors that are shared among the M doc- 
uments in the corpus. Documents are generated inde- 
pendently. For future reference, we adopt the following 
notation. We denote by (3 the unknown W x K topic- 
matrix whose columns arc the K latent topic word- 
distribution vectors. 9 denotes the K x M weight- 
matrix whose M columns arc the mixing weights over 
K topics for the M documents. These columns are 
assumed to be iid samples from a prior distribution. 
Each column of the W x M matrix A = f36 corre- 
sponds to a document word-distribution vector. X 
denotes the observed W x M word-by-document ma- 
trix realization. The M columns of X are the empirical 
word-frequency vectors of the AI documents. Our goal 
is to estimate the latent topic word-distribution vec- 
tors (/3) from the empirical word-frequency vectors of 
all documents (X). 

A fundamental challenge here is that word-by- 
document distributions (A) are unknown and only a 
realization is available through sampled word frequen- 
cies in each document. Another challenge is that even 
when these distributions are exactly known, the de- 
composition into the product of topic-matrix, /3, and 
topic-document distributions, 6, which is known as 
Nonnegative Matrix Factorization (NMF), has been 
shown to be an A/'T'-hard problem in general. In 
this paper, we develop computationally efhcient al- 
gorithms with provable guarantees for estimating (3 
for topic matrices satisfying the separability condi- 
tion (Donoho & Stoddcn, 2004; Arora et al., 2012b). 

Definition 1. (Separability) A topic matrix (3 € 
-^WxK separable if for each topic k, there is some 
word i such that f3i,k > and Pi^i = 0, V/ 7^ k. 

The condition suggests the existence of novel words 
that are unique to each topic. Our algorithm has three 
main steps. In the first step, we identify novel words by 
means of data dependent or random projections. A key 
insight here is that when each word is associated with 
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a vector consisting of its occurrences across all docu- 
ments, the novel words correspond to extreme points 
of the convex hull of these vectors. A highlight of our 
approach is the identification of novel words based on 
data-dependent and random projections. Our idea is 
that whenever a convex object is projected along a ran- 
dom direction, the maximum and minimum values in 
the projected direction correspond to extreme points of 
the convex object. While our method identifies novel 
words with negligible false and miss detections, evi- 
dently multiple novel words associated with the same 
topic can be an issue. To account for this issue, we 
apply a distance based clustering algorithm to cluster 
novel words belonging to the same topic. Our final 
step involves linear regression to estimate topic word 
frequencies using novel words. 

We show that our scheme has a similar sample com- 
plexity to that of state-of-art such as (Arora et al., 
2012a). On the other hand, the computational com- 
plexity of our scheme can scale as small as 0{^/MW + 
MN) for a corpora containing M documents, with an 
average of N words per document from a vocabulary 
containing W words. We then present a set of ex- 
periments on synthetic and real-world datasets. The 
results demonstrates qualitative and quantitative su- 
periority of our scheme in comparison to other state- 
of-art schemes. 

2. Related Work 

The literature on topic modeling and discovery is ex- 
tensive. One direction of work is based on solving 
a nonnegative matrix factorization (NMF) problem. 
To address the scenario where only the realization X 
is known and not A, several papers (Lee & Seung, 
1999; Donoho & Stodden, 2004; Cichocki et al., 2009; 
Recht et al., 2012) attempt to minimize a regularized 
cost function. Nevertheless, this joint optimization is 
non-convex and suboptimal strategies have been used 
in this context. Unfortunately, when N <^ W which 
is often the case, many words do not appear in X and 
such methods often fail in these cases. 

Latent Dirichlet Allocation (LDA) (Blei et al., 2003; 
Blei, 2012) is a statistical approach to topic modeling. 
In this approach, the columns of 6 are modeled as iid 
random drawings from some prior distributions such 
as Dirichlet. The goal is to compute MAP (maximum 
aposteriori probability) estimates for the topic matrix. 
This setup is inherently non-convex and MAP esti- 
mates are computed using variational Bayes approxi- 
mations of the posterior distribution, Gibbs sampling 
or expectation propagation. 



A number of methods with provable guarantees have 
also been proposed. (Anandkumar et al., 2012) de- 
scribe a novel method of moments approach. While 
their algorithm does not impose structural assumption 
on topic matrix /3, they require Dirichlet priors for 
matrix. One issue is that such priors do not permit cer- 
tain classes of correlated topics (Blei & Lafferty, 2007; 
Li & McCallum, 2007). Also their algorithm is not ag- 
nostic since it uses parameters of the Dirichlet prior. 
Furthermore, the algorithm suggested involves finding 
empirical moments and singular decompositions which 
can be cumbersome for large matrices. 

Our work is closely related to recent work of 
(Arora etal., 2012b) and (Arora et al., 2012a) with 
some important differences. In their work, they de- 
scribe methods with provable guarantees when the 
topic matrix satisfies the separability condition. Their 
algorithm discovers novel words from empirical word 
co-occurrence patterns and then in the second step 
the topic matrix is estimated. Their key insight is that 
when each word, j, is associated with a W dimensional 
vector^ the novel words correspond to extreme points 
of the convex hull of these vectors. (Arora et al., 
2012a) presents combinatorial algorithms to recover 
novel words with computational complexity scaling as 
0{MN^ + W'^ + WK/e^), where e is the element wise 
tolerable error of the topic matrix /3. An important 
computational remark is that e often scales with W, 
i.e. probability values in (3 get small when W is in- 
creased, hence one needs smaller e to safely estimate 
(3 when W is too large. The other issue with their 
method is that empirical estimates of joint probabil- 
ities in the word-word co-occurrence matrix can be 
unreliable, especially when M is not large enough. Fi- 
nally, their novel word detection algorithm requires lin- 
ear independence of the extreme points of the convex 
hull. This can be a serious problem in some datasets 
where word co-occurrences lie on a low dimensional 
manifold. 

Major Differences: Our work also assumes separa- 
bility and existence of novel words. We associate each 
word with a M-dimensional vector consisting of the 
word's frequency of occurrence in the M-documents 
rather than word co-occurrences as in (Arora et al., 
2012b;a). We also show that extreme points of the con- 
vex hull of these cross-document frequency patterns 
are associated with novel words. While these differ- 
ences appear technical, it has important consequences. 
In several experiments our approach appears to sig- 
nificantly outperform (Arora et al., 2012a) and mir- 

^kih component is probability of occurrence of word j 
and word k in the same document in the entire corpus 
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ror performance of more conventional methods such 
as LDA (Griffiths & Steyvers, 2004). Furthermore, 
our approach can deal with degenerate cases found in 
some image datasets where the data vectors can lie 
on a lower dimensional manifold than the number of 
topics. At a conceptual level our approach appears 
to hinge on distinct cross-document support patterns 
of novel words belonging to different topics. This is 
typically robust to sampling fluctuations when sup- 
port patterns are distinct in comparison to word co- 
occurrences statistics of the corpora. Our approach 
also differs algorithmically. We develop novel algo- 
rithms based on data-dependent and random projec- 
tions to find extreme points efficiently with computa- 
tional complexity scaling as 0{MN + \/MW) for the 
random scheme. 

Organization: We illustrate the motivating Topic 
Geometry in Section 3. We then present our three- 
step algorithm in Section 4 with intuitions and com- 
putational complexity. Statistical correctness of each 
step of proposed approach are summarized in Section 
5. We address practical issues in Section 6. 

3. Topic Geometry 

Recall that X and A respectively denote the W x M 
empirical and actual document word distribution ma- 
trices, and A = (30, where /3 is the latent topic 
word distribution matrix and 6 is the underlying 
weight matrix. Let A, and X denote the A, 6 
and X matrices after £i row normalization. We set 
f3 = diag(Al)-i/3diag(01), so that A = (36. Let X, 
and Ai respectively denote the i — th row of X and 
A representing the cross-document patterns of word i. 
We assume that f3 is separable (Dei. 1). Let Ck be the 
set of novel words of topic k and let Co be the set of 
non-novel words. 

The geometric intuition underlying our approach is 
formulated in the following proposition : 

Proposition 1. Let (3 be separable. Then for all novel 
words i e Cj, A; = Oj and for all non-novel words 
i € Cq, Ai is a convex combination of Oj 's, for j ~ 
f,...,K. 

Proof: Note that for all i, 

K 

fe=i 

and for all i & Cj, Pij ~ 1. Moreover, we have 

K 

fc=i 



_ ~ ^ K ^ ^ 

Hence Aj = Oj for i E Cj. In addition, A^ = J2 f^ikf^k 

k=l 

for i eCq. M 

Fig. 1 illustrates this geometry. Without loss of gener- 
ality, we could assume that novel word vectors 9i are 
not in the convex hull of the other rows of 9. Hence, 
The problem of identifying novel words reduces to find- 
ing extreme points of all A.^'s. 




Figure 1. A separable topic matrix and the underlying ge- 
ometric structure. Solid circles represent rows of A, empty 
circles represent rows of X. Projections of Xi's along a 
direction d can be used to identify novel words. 

Furthermore, retrieving topic matrix (3 is straightfor- 
ward given all K distinct novel words : 

Proposition 2. If the matrix A and K distinct novel 
words {ii, . . . , Ik} o.re given, then (3 can be calculated 
using W linear regressions. 

Proof: By Proposition f, we have 6 = 
(A^ , . . . , A^)^. Next Ai = f3i0. So /3i can be com- 
puted by solving a linear systcin of equations. Specifi- 
cally, if we let 13' = diag(Al)/3 = /3diag(01)"\ (3 can 
be obtained by column normalizing f3' . ■ 

Proposition 1 and 2 validate the approach to estimate 
(3 via identifying novel words given access to A. How- 
ever, only X, a realization of A, is available in the real 
problem which is not close to A in typical settings of 
interest {N <ti W). However, even when the number 
of samples per document (N) is limited, if we collect 
enough documents {M oo), the proposed algorithm 
could still asymptotically estimate f3 with arbitrary 
precision, as we will discuss in the following sections. 

4. Proposed Algorithm 

The geometric intuition mentioned in Propositions 1 
and 2 motivates the following three-step approach for 
topic discovery : 



Topic Discovery through Data Dependent and Random Projections 



(1) Novel Word Detection: Given the empirical 
word-by-documcnt matrix X, extract the set of all 
novel words I. We present variants of projection-based 
algorithms in Sec. 4.1. 

(2) Novel Word Clustering: Given a set of novel 
words I with \I\ > K , cluster them into K groups 
corresponding to K topics. Pick a representative for 
each group. We adopt a distance based clustering al- 
gorithm. (Sec. 4.2). 

(3) Topic Estimation: Estimate topic matrix as sug- 
gested in Proposition 2 by constrained linear regres- 
sion. (Section 4.3). 

4.1. Novel Word Detection 

Fig. 1 illustrates the key insight to identify novel 
words as extreme points of some convex body. When 
we project every point of a convex body onto some 
direction d. the maximum and minimum correspond 
to extreme points of the convex object. Our proposed 
approaches, data dependent and random projection, 
both exploit this fact. They only differ in the choice 
of projected directions. 

A. Data Dependent Projections (DDP) 

To simplify our analysis, we randomly split each doc- 
ument into two subsets, and obtain two statistically 
independent document collections X and X', both dis- 
tributed as A, and then row normalize as X and X'. 
For some threshold, d, to be specified later, and for 
each word i, we consider the set, J^, of all other words 
that are sufficiently different from word i in the fol- 
lowing sense: 

J, = {j I A/(X, - X,)(X: - X;y > d/2} (1) 

We then declare word i as a novel word if all words 
i £ Ji are uniformly uncorrelated to word i with some 
margin, 7/2 to be specified later. 

M{%,±[) > M{±^,^) + 7/2, Vj G J, (2) 

The correctness of DDP Algorithm is established by 
the following Proposition and will be further discussed 
in section 5. The proof is given in the Supplementary 
section. 

Proposition 3. Suppose conditions PI and P2 (will 
be defined in section 5) on prior distribution of 6 hold. 
Then, there exists two positive constants d and 7 such 
that if i is a novel word, for all j € Ji, M(Xi,X^) — 
M(Xi,X^) > 7/2 with high probability (converging to 
one as M 00). In addition, if i is a non-novel 
word, there exists some j € Ji such that M(X.i, X^) — 
M(X.i,Xj) < 7/2 with high probability. 



Algorithm 1 Novel Word Detection - DDP 

1: Input X,X',d,7,is: 

2: Output: The indices of the novel words I 

3: M X'X^ 

4: 

5: for all 1 < i < do 

6: Ji All indices j ^ i : Ci^^ - 2Cjj + Cj,j > | 

7: if Vj e Ji : Ci^i - > 7/2 then 

8: I^I\J{i} 

9: end if 

10: end for 



The algorithm is elaborated in Algorithm 1. The run- 
ning time of the algorithm is summarized in the fol- 
lowing proposition. Detailed justification is provided 
in the Supplementary section. 

Proposition 4. The running time of Algorithm 1 is 
0{MN^ + W^). 

Proof Sketch. Note that X is sparse since N <C W . 
Hence by exploiting the sparsity C = A/XX'^ can 
be computed in 0{MN'^ -\- W) time. For each word i, 
finding Ji and calculating Ci^i — Cij > 7/2 cost 0{W^) 
time in the worst case. ■ 

B. Random Projections (RP) 

DDP uses W different directions to find all the extreme 
points. Here we use random directions instead. This 
significantly reduces the time complexity by decreasing 
the number of required projections. 

The Random Projection Algorithm (RP) uses roughly 
P = 0(K) random directions drawn uniformly iid over 
the unit sphere. For each direction d, we project all 
Xi's onto it and choose the maximum and minimum. 
Note that Xjd will converge to Ajd conditioned on d 



Algorithm 2 Novel Word Detection - RP 

1: Input X,P 

2: Output : The indices of the novel words I 

3: 

4: for all 1 < j < P do 

5: Generate d ~ Uniform(unit-sphere in R*^) 

6: imax = argmaxXid, i^in = argmaxXid 

7: T i T U \imax: ^mm} 

8: end for 



and 6 as AI increases. Moreover, only for the extreme 
points i, Aid can be the maximum or minimum pro- 
jection value. This provides intuition of consistency for 
RP. Since the directions are independent, we expect to 
find all the novel words using P = 0{K) number of 
random projections. 
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C. Random Projections with Binning 

Another alternative to RP is a Binning algoritlrm 
wliicli is computationally more efBcient. Here the cor- 
pus is split into ^/M equal sized bins. For each bin j a 
random direction d*-^-* is chosen and the word with the 
maximum projection along d*-^-* is chosen as a winner. 
Then, we find the number of wins for each word i. We 
then divide these winning frequencies by a/M as an 
estimate for pi = Pr(Vj 7^ i : A^d > A^d). pi can 
be shown to be zero for all non-novel words. For non- 
degenerate prior over 0, these probabilities converge 
to strictly positive values for novel words. Hence, esti- 
mating piS helps in identifying novel words. We then 
choose the indices of 0{K) largest pi values as novel 
words. The Binning algorithm is outlined in Algorithm 
3. 



Algorithm 3 Novel Word Detection - Binning 
1: Input : X, X', d, K 

2: Output : The indices of the novel words I 
3: Split documents in X into \fM equal sized groups 
of documents X*^^', . . . , X^^^' and normalize each 
one separately to obtain X^^^ . . . , X^^^' as well. 

4: for all 1 < j < \/I7 do 

5: d(j) ^ a sample from U{S^-^) 

6: Z ^ argmaxXp'd(^) 

l<i<W 
7: ^ pf'^ + 1 

8: end for 

9: for all 1 < i < do 
11: end for 

12: fc ^ 0, 1 and i ^ 1 
13: repeat 

14: j <— the index of the i**^ largest value of 

15: if 1 = or V/ e I : M(Xj -XO(X^ -XJ) > d/2 

then 

16: X^IU{j} 

17: + 1 

18: end if 

19: I ^ I -f 1 

20: until k = K 



In contrast with DDP, the RP algorithm is completely 
agnostic and parameter-free. This means that it re- 
quires no parameters like d and 7 to find the novel 
words. Moreover, it significantly reduces the compu- 
tational complexity : 

Proposition 5. The running times of the RP and 
Binning algorithms are 0{MNK+WK) andO{MN+ 



^/MW), respectively. 

Proof. We will sketch the proof and provide a more de- 
tailed justification in the Supplementary section. Note 
that the number of operations needed to find the pro- 
jections is 0{MN + W) in Binning and 0{MNK + W) 
in RP. In addition, finding the the maximum takes 
0{WK) for RP and 0{^/MW) for Binning. In sum, 
it takes 0{MNK+WK) for RP and 0{MN+VMW) 
for Binning to find all the novel words. □ 

4.2. Novel Word Clustering 

Since there may be multiple novel words for a single 
topic, our DDP or RP algorithm can extract multiple 
novel words for each topic. This necessitates clustering 
to group the copies. We can show that our clustering 
scheme is consistent if we assume that R = jj E{99^) 
is positive definite: 

Proposition 6. Let Cij = MX.iX.'j^ , and Dij = 
Ci^i — 2Cij + Cjj. // R is positive definite, then Dij 
converges to zero in probability whenever i and j are 
novel words of the same topic as M — >■ 00. More- 
over, if i and j are novel words of different types, it 
converges in probability to some strictly positive value 
greater than some constant d . 

The proof is presented in the Supplementary section. 
As the Proposition 6 suggests, we construct a bi- 

Algorithm 4 Novel Word Clustering 
1: Input : I, X, X', rf, K 

2: Output : J' which is a set of K novel words of 

distinct topics 
3: C ^ M X'X^ 
4: B <— a |I| X \I\ zero matrix 
5: for all i, j G I, i j do 
6: if Ci.i - 2Cij + Cj,j < d/2 then 
7: B,,j ^ 1 
8: end if 
9: end for 
10: J^?) 

11: for all 1 < j < if do 

12: c ^ one of the indices of the j^^ connected com- 
ponent vertices in B 
13: J^jU{c} 
14: end for 



nary graph with its vertices correspond to the novel 
words. An edge between word i and j is established if 
Di,j < d/2. Then, the clustering reduces to finding K 
connected components. The procedure is described in 
Algorithm 4. 

In Algorithm 4, we simply choose any word of a cluster 
as the representative for each topic. This is simply 
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for theoretical analysis. However, we could set the 
representative to be the average of data points in each 
cluster, which is more noise resilient. 

4.3. Topic Matrix Estimation 

Given K novel words of different topics (J^), we could 
directly estimate (/3) as in Proposition 2. This is de- 
scribed in Algorithm 5. We note that this part of the 
algorithm is similar to some other topic modeling ap- 
proaches, which exploit separability. Consistency of 
this step is also validated in (Arora ct al.. 2012b). In 
fact, one may use the convergence of cxtrcmum esti- 
mators (Amemiya, 1985) to show the consistency of 
this step. 



Algorithm 5 Topic Matrix Estimation 
1: Input: J = in,...,jK}, X, X' 
2: Output: (3, which is the estimation of /3 matrix 
3: Y = (XJ,...,X^)^Y'^(X^,...,X^)T 
4: for all 1 < i < do 

5: argmin MiX,-hY)(Xr- 

bY')^ 
6: end for 

7: column normalize (3 



5. Statistical Complexity Analysis 

In this section, we describe the sample complexity 
bound for each step of our algorithm. Specifically, 
we provide guarantees for DDP algorithm under some 
mild assumptions on the distribution over 6. The anal- 
ysis of the random projection algorithm is much more 
involved and requires elaborate arguments. We will 
omit it in this paper. 

We require following technical assumptions on the cor- 
relation matrix R and the mean vector a of : 

(PI) R is positive definite with its minimum eigen- 
value being lower bounded by > 0. In addition, 
Vi, > a/\ > 0. 

(P2) There exists a positive value C such that for i ^ j, 
Ri,i/{aiai) - Rij/{a,aj) > (. 

The second condition captures the following intuition : 
if two novel words are from different topics, they must 
appear in a substantial number of distinct documents. 
Note that for two novel words i and j of different 
topics, MAi{Ai - Aj^ A Ri^i/{a^ai) - Ry/{a^aj). 
Hence, this requirement means that Af(A,; — Aj) 
should be fairly distant from the origin, which implies 
that the number of documents these two words co- 



occur in, with similar probabilities, should be small. 
This is a reasonable assumption, since otherwise we 
would rather group two related topics into one. In 
fact, wc show in the Supplementary section (Section 
A. 5) that both conditions hold for the Dirichlet distri- 
bution, which is a traditional choice for the prior dis- 
tribution in topic modeling. Moreover, we have tested 
the validity of these assumptions numerically for the 
logistic normal distribution (with non-degenerate co- 
variance matrices), which is used in Correlated Topic 
Modehng (CTM) (Blei & Lafferty, 2007). 

5.1. Novel Word Detection Consistency 

In this section, we provide analysis only for the DDP 
Algorithm. The sample complexity analysis of the 
randomized projection algorithms is however more in- 
volved and is the subject of the ongoing research. Sup- 
pose PI and P2 hold. Denote /3a and to be positive 
lower bounds on non-zero elements of f3 and minimum 
eigenvalue of R, respectively. We have: 

Theorem 1. For parameter choices d = Aa/?^ '^'^^ 
7 = Ca/\l3/\ the DDP algorithm is consistent as 
Af — > oo. Specifically, true novel and non-novel words 
are asymptotically declared as novel and non-novel, re- 
spectively. Furthermore, for 

where Ci is a constant, Algorithm 1 finds all novel 
words without any outlier with probability at least 
\ — 5i, where ri = min diSL. 

l<i<W 

Proof Sketch. The detailed justification is provided 
in the Supplementary section. The main idea of the 
proof is a sequence of statements : 

• Given PI, for a novel word i, Jt defined in the 
Algorithm 1 is a subset of J* asymptotically with 
high probability, where J* = {j : supp(/3j) 7^ 
supp(/3i)}. Moreover Ji is a superset of J* with 
high probability for a non-novel word with J* = 
{j:|supp(/3,)| = l}. 

• Given P2, for a novel word i, Ci.i — Cij converges 
to a strictly positive value greater than 7 for j G 
J* , and if i is non-novel, 3j G J* such that Ci^i — 
Ci_j converges to a non-positive value. 

These statements imply Proposition 3, which proves 
the consistency of the DDP Algorithm. ■ 

The term 77"® seems to be the dominating fac- 
tor in the sample complexity bound. Basically, 
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77 = min jj E(Xjl) represents the minimum propor- 
tion of documents that a word would appear in. This 
is not surprising as the rate of convergence of Cij = 
M(Xi,X^) is dependent on the values of -i-E(Xil) 
and j^E(Xjl). As these values are decreased, Cij 
converges to a larger value and the convergence get 
slower. In another view, given that the number of 
words per document N is bounded, in order to have 
Cij converge, a large number of documents is needed 
to observe all the words sufficiently. It is remarkable 
that a similar term would also arise in the sample 
complexity bound of (Arora et al., 2012b), where p is 
the minimum non-zero element of diagonal part of (3. 
It may be noted that although it seems that the sam- 
ple complexity bound scales logarithmically with W, 
•q and p would be decreased typically as W increases. 

5.2. Novel Word Clustering Consistency 

We similarly prove the consistency and sample com- 
plexity of the novel word clustering algorithm : 

Theorem 2. Ford = X^jSl, given all true novel words 
as the input, the clustering algorithm, Algorithm 4 
(ClusterNovelWords) asymptotically (as M ^ 00 re- 
covers K novel word indices of different types, namely, 
the support of the corresponding (3 rows are different 
for any two retrieved indices. Furthermore, if 



M > 



C2(logVF + log(^ 



then Algorithm 4 clusters all novel words correctly 
with probability at least 1 ^ 62- 



Proof Sketch. More detailed analysis is provided 
in the Supplementary section. We can show that 
Ci.i. 2C, 



Cjj converges to a strictly positive value 
d if i and j are novel words of different topics. More- 
over, it converges to zero if they are novel words of 
the same topic. Hence all novel words of the same 
topic are connected in the graph with high probability 
asymptotically. Moreover, there would not be an edge 
between the novel words of different topics with high 
probability. Therefore, the connected components of 
the graph corresponds to the true clusters asymptoti- 
cally. The detailed discussion of the convergence rate 
is provided in the Supplementary section. ■ 

It is noticeable that the sample complexity of the clus- 
tering is similar to that of the novel word detection. 
This means that the hardness of novel word detection 
and distance based clustering using the proposed al- 
gorithms are almost the same. 



5.3. Topic Estimation Consistency 

Finally, we show that the topic estimation by regres- 
sion is also consistent. 

Theorem 3. Suppose that Algorithm 5 outputs (3 
given the indices of K distinct novel words. Then, 
/3 A- /3. Specifically, if 



M > 



CaW^logiW) + log(/v ) + log(l/53)) 



\2 „8,4„8 



then for all i and j , will be e close to f3i_j with 
probability at least 1 — S3, with e < 1, C3 being a 
constant, Aa = min; and 77 ~ min /3;a. 

l<i<W 



Proof Sketch. We will provide a detailed analysis in 
the Supplementary section. To prove the consistency 
of the regression algorithm, we will use a consistency 
result for the extremum estimators : If we assume 
Qm{I3) to be a stochastic objective function which 
is minimized at (3 under the constraint /3 e 9 (for 
a compact 0), and QM{f3) converges uniformly to 
Q{(3), which in turn is minimized uniquely in f3* , then 
(3 ^ j3* (Amemiya, 1985). In our setting, we may take 
Qm to be the objective function in Algorithm 5. Then, 

QM(b) A Q(b) = bDRDb^ - 2bDR|J^ + 
where D ~ diag(a)~^. Note that if R is positive defi- 
nite, Q is uniquely minimized at b* ~ ^^D~^, which 
satisfies the conditions of the optimization. Moreover, 
Q M converges to Q uniformly as a result of Lipschitz 
continuity of Qm- Therefore, according to Slutsky's 
theorem, (■p-Xil)b* = (3i converges to /3iD~^, and 

hence the column normalization of (3 converges to (3. 
We will provide a more detailed analysis of this part 
in the Supplementary section. ■ 

In sum, consider the approach outlined at the begin- 
ning of section 4 based on data-dependent projections 
method, and assume that (3 is the output. Then, 

Theorem 4. The output of the topic modeling algo- 
rithm (3 converges in probability to (3 element-wise. To 
be precise, if 



M > max ■ 



C'2W^ log 



WK 

S 



CI log ^ 



XWe'< min(A2/32,C2a2' 



then with probability at least 1 — 36, for all i and k, 
Pi i; will be e close to Pi^k, with e < 1, C[ and being 
two constants. 

The proof is a combination of Theorems 1, 2 and 3. 
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6. Experimental Results 

6.1. Practical Considerations 

DDP algorithm requires two parameters 7 and d. In 
practice, we can apply DDP without knowing them 
adaptively and agnostically. Note that d is for the 
construction of Ji. We can otherwise construct Ji by 
finding r < W words that are maximally distant from 
i in the sense of Eq. 1. To bypass 7, we can rank the 
values of minjgj^ M(X.i, X^) — M(Xi, X^) across all i 
and declare the topmost s values as the novel words. 

The clustering algorithm also requires parameter d. 
Note that d is just for thresholding a — 1 weighted 
graph. In practice, we could avoid hard thresholding 
by using cxp(— (C^^i — 2Cij + Cjj)) as weights for the 
graph and apply spectral clustering. To point out, 
typically the size of X in Algorithm 4 is of the same 
order as K. Hence the spectral clustering is on a rela- 
tive small graph which typically adds 0{K^) compu- 
tational complexity. 

Implementation Details: We choose the parame- 
ters of the DDP and RP in the following way. For 
DDP in all datasets except the Donoho image corpus, 
we use the agnostic algorithm discussed in section 6.1 
with r = W/2. Moreover, we take s = 10 x K. For 
the image dataset, we used d = 1 and 7 = 3. For RP, 
we set the number of projections P w 50 x if in all 
datasets to obtain the results. 

6.2. Synthetic Dataset 



a Rand.Proj. 

_,»_,,NMFKL 




Figure 2. Error of estimated topic matrix in ii norm. Up- 
per: W = 500, p = 0.2, iV = 100, K = 5; Lower: W = 
500, p = 0.2, M = 500, K = 10. Top and Bottom plots 
depict error with varying documents M (for fixed A^) and 
varying words A'^ (for fixed M) respectively. RP & DDP 
show consistently better performance. 



In this section, we validate our algorithm on synthetic 
examples. We generate a, W x K separable topic ma- 
trix (3 with Wi/K > 1 novel words per topic as fol- 
lows: first, iid 1 X _ftr row- vectors corresponding to non- 
novel words are generated uniformly on the probability 
simplex. Then, Wi iid Uniform[0, 1] values are gener- 
ated for the nonzero entries in the rows of novel words. 
The resulting matrix is then column-normalized to get 
one realization of f3. Let p := Wi/W. Next, M iid 
K X 1 column-vectors are generated for the 6 matrix 

K 

according to a Dirichlet prior 

cOC*"^- Following 

i=l 

(Griffiths & Steyvers, 2004), we set Ui = 0.1 for aU i. 
Finally, wc obtain X by generating N iid words for 
each document. 

For different settings of W, p, K, M and A^, we cal- 
culate the ii distance of the estimated topic matrix to 
the ground truth after finding the best matching be- 
tween two sets of topics. For each setting we average 
the error over 50 random samples. For RP & DDP 
we set parameters as discussed in the implementation 
details. 

We compare the DDP and RP against the Gibbs sam- 
pling approach (Griffiths & Steyvers, 2004) (Gibbs), 
a state-of-art NMF-bascd algorithm (Tan & Fevotte, 
in press) (NMF) and the most recent practical prov- 
able algorithm in (Arora et al., 2012a) (RecL2). The 
NMF algorithm is chosen because it compensates for 
the type of noise in our topic model. Fig. 2 depicts the 
estimation error as a function of the number of docu- 
ments M (Upper) and the number of words/document 
A^ (bottom). RP and DDP have similar performance 
and are uniformly better than comparable techniques. 
Gibbs performs relatively poor in the first setting and 
NMF in the second. RecL2 perform worse in all the 
settings. Note that M is relatively small (< 1,000) 
compared to = 500. DDP/RP outperform other 
methods with fairly small sample size. Meanwhile, as 
is also observed in (Arora et al., 2012a), RccL2 has a 
poor performance with small M . 

6.3. Swimmer Image Dataset 
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Figure 3. (a) Example "clean" images in Swimmer dataset; 
(b) Corresponding images with sampling "noise" ; (c) Ex- 
amples of ideal topics. 
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Poi LA 1 LA 2 LA 3 LA 4 RA 1 RA 2 RA 3 RA 4 LL 1 LL 2 LL 3 LL 4 RL 1 RL 2 RL 3 RL 4 



a) 




b) 




c) 








d) 








e) 







Figure 5. Topics estimated for noisy swimmer dataset by a) proposed RP, b) proposed DDP, c) Gibbs in 
(Griffiths & Steyvers, 2004), d) NMF in (Tan & Fevotte, in press) and e) on clean dataset by RecL2 in (Arora et al., 
2012a) closest to the 16 ideal (ground truth) topics. Gibbs misses 5 and NMF misses 6 of the ground truth topics while 
RP DDP recovers all 16 and our topic estimates look less noisy. RecL2 hits 4 on clean dataset. 
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Figure 4. Topic errors for (a) Gibbs (Griffiths & Steyvers, 
2004), (b) NMF (Tan & Fevotte, in press) and c) exam- 
ple Topics extracted by RecL2 (Arora et al., 2012a) on the 
noisy Swimmer dataset. d) Example Topic errors for RecL2 
on clean Swimmer dataset. Figure depicts extracted topics 
that are not close to any "ground truth" . The ground truth 
topics correspond to 16 different positions of left/right 
arms and legs. 



In this section we apply our algorithm to the 
synthetic swimmer image dataset introduced in 
(Donoho & Stodden, 2004). There are M = 256 bi- 
nary images, each with = 32 x 32 = 1024 pixels. 
Each image represents a swimmer composed of four 
limbs, each of which can be in one of 4 distinct posi- 
tions, and a torso. Wc interpret pixel positions as 
words. Each image is interpreted as a document com- 
posed of pixel positions with non-zero values. Since 
each position of a limb features some unique pixels in 
the image, the topic matrix (3 satisfies the separability 
assumption with K = 16 "ground truth" topics that 



correspond to 16 single limb positions. 

Following the setting of (Tan & Fevotte, in press), we 
set body pixel values to 10 and background pixel val- 
ues to 1. Wc then take each "clean" image, suitably 
normalized, as an underlying distribution across pix- 
els and generate a "noisy" document of TV = 200 
lid "words" according to the topic model. Exam- 
ples arc shown in Fig. 3. We then apply RP and 
DDP algorithms to the "noisy" dataset and com- 
pare against Gibbs (Griffiths & Steyvers, 2004), NMF 
(Tan & Fevotte, in press), and RecL2 (Arora et al., 
2012a). Results are shown in Figs. 4 and 5. We set the 
parameters as discussed in the implementation details. 

This dataset is a good validation test for different al- 
gorithms since the ground truth topics are known and 
unique. As we see in Fig. 4, both Gibbs and NMF 
produce topics that do not correspond to any pure 
left/right arm/leg positions. Indeed, many of them are 
composed of multiple limbs. Nevertheless, as shown in 
Fig. 5, no such errors arc realized in RP and DDP 
and our topic-estimates are closer to the ground truth 
images. In the meantime, RecL2 algorithm failed to 
work even with the clean data. Although it also ex- 
tracts extreme points of a convex body, the algorithm 
additionally requires these points to be linearly inde- 
pendent. It is possible that extreme points of a con- 
vex body are linearly dependent (for example, a 2-D 
square on a 3-D simplex). This is exactly the case 
in the swimmer dataset. As wc sec in the last row 
in Fig. 5, RecL2 produces only a few topics close to 
ground truth. Its extracted topics for the noisy im- 
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ages are shown in Fig. 4. Results of RecL2 on noisy 
images are no close to ground truth as shown in Fig. 4. 

6.4. Real World Text Corpora 

Table 1. Examples of extracted topics for NIPS 
dataset by proposed Random projection method 
(RP), Data-dependent projection (DDP), algorithm 
in (Griffiths & Steyvers, 2004) (Gibbs), the practical 
algorithm in (Arora et al., 2012a) (RecL2). 



RP 
DDP 
Gibbs 
RecL2 


chip circuit noise analog current voltage gates 
chip circuit analog voltage pulse vlsi device 
analog circuit chip output figure current vlsi 
N/A 


RP 
DDP 
Gibbs 
RecL2 


visual cells spatial ocular cortical cortex domi- 
nance orientation 

visual cells model cortex orientation cortical 
eye 

cells cortex visual activity orientation cortical 
receptive 

orientation knowledge model cells visual good 
mit 


RP 

DDP 

Gibbs 
RecL2 


learning training error vector parameters svm 
data 

learning error training weight network function 
neural 

training error set generalization examples test 
learning 

training error set data function test weighted 


RP 
DDP 
Gibbs 
RecL2 


speech training recognition performance hmm 
mlp input 

training speech recognition network word clas- 
sifiers hmm 

speech recognition word training hmm speaker 
mlp acoustic 

speech recognition network neural positions 
training learned 


Table 2. Examples of estimated topics on A'^F Times using 
RP and RecL2 algorithms 


RP 
RecL2 


weather wind air storm rain cold 
N/A 


RP 
RecL2 


feeling sense love character heart emotion 
N/A 


RP 
RecL2 


election zzz_florida ballot vote zzz_al_gore re- 
count 

ballot election court votes vote zzz_aLgore 


RP 
RecL2 


yard game team season play zzz_nfi 
yard game play season team touchdown 


RP 
RecL2 


N/A 

zzz_kobe_bryant zzz_super_bowl police shot 
family election 



In this section, we apply our algorithm on two differ- 
ent real world text corpora from (Frank & Asuncion, 



2010). The smaller corpus is NIPS proceedings dataset 
with M = 1,700 documents, a vocabulary oi W = 
14, 036 words and an average oi N 900 words in 
each document. Another is a large corpus New York 
(NY) Times articles dataset, with M = 300,000, 
W = 102, 660, and N w 300. The vocabulary is ob- 
tained by deleting a standard "stop" word list used in 
computational linguistics, including numbers, individ- 
ual characters, and some common English words such 
as "the". 

In order to compare with the practical algorithm in 
(Arora et ah, 2012a), we followed the same pruning 
in their experiment setting to shrink the vocabulary 
size to VF = 2, 500 for NIPS and W = 15, 000 for NY 
Times. Following typical settings in (Blei, 2012) and 
(Arora et ah, 2012a), we set K = 40 for NIPS and 
K — 100 for NY Times. We set our parameters as 
discussed in implementation details. 

We compare DDP and RP algorithms against RecL2 
(Arora et ah, 2012a) and a practically widely success- 
ful algorithm (Griffiths & Steyvers, 2004) (Gibbs). Ta- 
ble 1 and 2^ depicts typical topics extracted by the 
different methods. For each topic, we show its most 
frequent words, listed in descending order of the esti- 
mated probabilities. Two topics extracted by different 
algorithms are grouped if they are close in £i distance. 

Different algorithms extract some fraction of similar 
topics which are easy to recognize. Table 1 indicates 
most of the topics extracted by RP and DDP are simi- 
lar and are comparable with that of Gibbs. We observe 
that the recognizable themes formed with DDP or RP 
topics are more abundant than that by RecL2. For 
example, topic on "chip design" as shown in the first 
panel in Table 1 is not extracted by RecL2, and topics 
in Table 2 on "weather" and "emotions" are missing in 
RecL2. Meanwhile, RecL2 method produces some ob- 
scure topics. For example, in the last panel of Table 1, 
RccL2 contains more than one theme, and in the last 
panel of Table 2 RecL2 produce some unfathomable 
combination of words. More details about the topics 
extracted are given in the Supplementary section. 

7. Conclusion and Discussion 

We summarize our proposed approaches (DDP, Bin- 
ning and RP) while comparing with other existing 
methods in terms of assumptions, computational com- 
plexity and sample complexity (see Table 3). Among 
the list of the algorithms, DDP and RecL2 are the best 
and competitive methods. While the DDP algorithm 
has a polynomial sample complexity, its running time 

■^the zzz prefix annotates the named entity. 
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Table 3. Comparison of Approaches. Recover from (Arora et al., 2012b); RecL2 from (Arora et al., 2012a); EGA from 
(Anandliumar et al., 2012); Gibbs from (Griffiths & Steyvers, 2004); NMF from (Lee & Seung, 1999). Timew{L.P), 
TimeK^L.R) stands for computation time for Linear Programming or Linear Regression for W and K number of 
variables respectively; The definition of the set of parameters can be found in the reference papers. 



MetnoQ 


Gomputational 
Gomplexity 


Sample complexity(Af) 


Assumptions 


Remarks 


\J\Jr 


U(i\ M -\- W ) + 
W TimeK{L.R) 


c;iogf 1 


oeparable 
/3; PI and 
P2 on Prior 
Distribution 
of e (Sec. 5); 
Knowledge 
of 7 and d 
(defined in 
Algorithm 1) 


Fr(Errorj — >■ U 
exponentially 


RP 


0{MNK + WK) + 
W TimeK{L.R) 


N/A 


Separable /3 




Binning 


0(MN + VMW) + 
W TimeK{L.R) 


N/A 


Separable /3 




Recover (Arora et al., 
2012b) 


0{MN')+ 

W Timew{L.P) + 

W TimeK{L.R) 




Separable 
(3; Robust 
Simplicial 
Property of R 


Pr(Error) — > 
0; Too many 
Linear Pro- 
grammings 
make the 
algorithm 
impractical 


RecL2 (Arora et al., 
2012a) 


0(W + WK/e + 

MN'^)+ 

W TimeK{L.R) 


fCiaif^losfW) Coa^if" Iok(W') 1 

niax s w (J , y 4 4 r 


Separable 
/3; Robust 
Simplicial 
Property of R 


Pr(Error) — >■ 
0; Requires 
Novel words 
to be linearly 
independent; 


EGA 

(Anandkumar et al., 
2012) 


0(W^ + AfAT^) 


N/A : For the provided basic algo- 
rithm, the probability of error is at 
most 1/4 but does not converge to 
zero 


LDA model; 
The con- 
centration 
parameter of 
the Dirichlet 
distribution 
Qo is known 


Requires solv- 
ing SVD for 
large matrix, 
which makes 
it impractical; 
Pr(Error) -t^ 
for the basic 
algorithm 


Gibbs 

(Griffiths & Steyvers, 
2004) 


N/A 


N/A 


LDA model 


No con- 
vergence 
guarantee 


NMF (Tan & Fevotte, 


N/A 


N/A 


General model 


Non-convex 


in press) 








optimiza- 
tion; No 
convergence 
guarantee 



is better than that of RecL2, which depends on 1/e^. 
Although e seems to be independent of W, by increas- 
ing W the elements of /3 would be decreased and the 
precision (e) which is needed to recover (3 would be 
decreased. This results in a larger time complexity in 
RecL2. In contrast, time complexity of DDP docs not 
scale with e. On the other hand, the sample complex- 
ity of both DDP and RecL2, while polynomially scal- 
ing, depend on too many different terms. This makes 
the comparison of these sample complexities difficult. 



However, terms corresponding to similar concepts ap- 
peared in the two bounds. For example, it can be seen 
that pa A ~ ?7, because the novel words are possibly the 
most rare words. Moreover, A a and 7 which are the 
£^ and £^ condition numbers of R arc closely related. 
Finally, a = with av and being the maximum 
and minimum values in a. 
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Supplementary Materials 



A. Proofs 



Given (3 is separable, we can reorder the rows of (3 such 
[/3'J 

the same structure for (3 throughout the section. 



that /3 = 



, where D is diagonal. We will assume 



A.l. Proof of Proposition 3 

Proposition 3 is a direct result of Theorem 1. Please 
refer to section A. 7 for more details. 

A. 2. Proof of Proposition 4 

Recall that Proposition 4 summarizes the computa- 
tional complexity of the DDP Algorithm 1. Here we 
provide more details. 

Proposition 4 (in Section 4.1). The running time 
of Data dependent projection Algorithm DDP 1 is 
OiMN^ + W^). 

Proof : We can show that, because of the sparsity of 
X, C = MXX'^ can be computed in 0{MN^ + W) 
time. First, note that C is a scaled word- word co- 
occurrence matrix, which can be calculated by adding 
up the co-occurrence matrices of each document. This 
running time can be achieved, if all W words in the 
vocabulary are first indexed by a hash table (which 
takes 0[W)). Then, since each document consists of 
at most TV words, 0{N'^) time is needed to compute 
the co-occurrence matrix of each document. Finally, 
the summation of these matrices to obtain C would 
cost Oi^MN"^), which results in total 0{MN'^ + W) 
time complexity. Moreover, for each word i, we have 
to find Ji and test whether Ci^i — Cij > 7/2 for all 
j G Ji. Clearly, the cost to do this is 0{W^) in the 
worst case. ■ 

A. 3. Proof of Proposition 5 

Recall that Proposition 5 summarizes the computa- 
tional complexity of RP ( Algorithm 2) and Binning 
(and see Section B in appendix for more details) . Here 
we provide a more detailed proof. 

Proposition 5 (in Section 4.1) Running time of 
RP (Algorithm 2) and Binning algorithm (in Appendix 
Section B) are 0{MNK+WK) and 0{MN+VmW), 
respectively. 

Proof : Note that number of operations needed to 
find the projections is 0{MN + W) in Binning and 
0{MNK + W) in RP. This can be achieved by first 
indexing the words by a hash table and then finding 



the projection of each document along the correspond- 
ing component of the random directions. Clearly, that 
takes 0{N) time for each document. In addition, find- 
ing the word with the maximum projection value (in 
RP) and the winner in each bin (in Binning) will take 
0{W). This counts to be 0{WK) for all projections 
in RP and 0{VmW) for all of the bins in Binning. 
Adding running time of these two parts, the compu- 
tational complexity of the RP and Binning algorithms 
will be 0{MNK + WK) and 0[MN + y/MW), re- 
spectively. ■ 

A. 4. Proof of Proposition 6 

Proposition 6 (in Section 4.2) is a direct result of The- 
orem 2. Please read section A. 8 for the detailed proof. 

A. 5. Validation of Assumptions in Section 5 
for Dirichelet Distribution 

In this section, we prove the validity of the assump- 
tions PI and P2 which were made in Section 5. 



pK 



with Yl^=l^i = ^^^i > 0; X 



For X e 



Dir{ai, . . . , ax) has pdf P(x) = 
a A = niin on and ao = J2f=i '^i 

l<i<K 



Let 



Proposition A.l For a Dirichlet prior 
Dir{ai, . . .,aK): 

1 . The correlation matrix R is positive definite with 
minimum eigenvalue Aa > a„(^ao+i) ' 

2. yi< i ^ j < K, - ^ ^ > 0. 

Proof. The covariancc matrix of Dir(ai, . . . , ax), de- 
noted as E, can be written as 



if I 7^ j 
otherwise 



(3) 



Compactly we have S 



"o("o + l) 



OiOt^ + ao diag(Q:)) 



with a = (q!i, . . . , ax). The mean vector fi 
Hence we obtain 



R = ^5- (—aa^ + aa diag(Q;)) H ^aa^ 

a5(ao + 1) "0 



— (aa^ + diag(a)) 



Note that > for all i, acx^ and diag(a) are posi- 
tive definite. Hence R is strictly positive definite, with 
eigenvalues Xi = — Therefore Aa > — , 

<= ' ao(ao + l) " — ao{aa + l) 

The second property follows by directly plug in equa- 
tion (3). □ 
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A. 6. Convergence Property of the 
co-occurrence Matrix 

In this section, wc prove a set of Lemmas as ingredients 
to prove the main Theorems 1, 2 and 3 in Section 5. 
These Lemmas in sequence show : 

• Convergence of C = i\/XX'^; (Lemma 1) 

• Convergence of Ci^i — 2Cij + Cjj to a strictly 
positive value if i,j are not novel words of the 
same topic; (Lemma 2) 

• Convergence of Ji to J* such that if i is novel, 
Ci^i — Cij converges to a strictly positive value 
for j G J*, and if i is non-novel, 3j G J* such 
that Ci,i — Cij converges to a non-positive value 
(Lemmas 3 and 4). 



To be precise, we have: 

E(JyX,x;^) 



EeEx|e(j^X,X' 



3 ' 



'Eg Ex|e(^Xa)EeEx|e(i^X;.l) 
^Ee(^A,l)Ee(^A,l) 

'(Aa)(/3,a) 



ther define Ei = ^R-^. n 



min diSL. Let R 

l<i<W 



Recall that in Algorithm 1, C = i\fXX'^. Let's fur- To show the convergence rate explicitly, we use propo- 

sition 7. For simplicity, define Cij = q'^. ■ Note that 
entries in X^ and X^ are independent and bounded, 
by Ho eff ding's inequality, we obtain: 

Pr{\Fij -E{Fij)\ >e)< 2cxp{^2Me^) 
Pr(|G, - E(G,)| > e) < 2exp(-2A/e2) 
Pr{\Hj -E{Hj)\ > e) < 2exp(-2A/e2) 

Hence, 

Pt{\G;Hj -E(G,)E(i/j)| > e) < 8 exp(-il/eV2) 



and a be the correlation matrix and mean vector of 
prior distribution of 9. 

Before we dig into the proofs, wc provide two limit 
analysis results of Slutsky's theorem : 

Proposition 7. For random variables Xn and Yn and 
real numbers x,y >0, i/Pr(|X„ — cc| > e) < (7n(e) and 
Pr{\Yn~y\ > e) < /i„(e), then 



Pr(|X„/y„-a;/2/| > e) < .g„ (f (^)+'^" (| 



And if < x,y < 1 

Pr(|X„r„ - xy\ > e) < .g„ (|) + /i„ (| 



and 



Pr 



F, 



>e < 



G,Hj E(G,)E(7J,] 
2exp(-A/e2(/3ja/3ia)V8)-h8exp(-Me2(/3^.aAa)4/32) 



Lemma 1. Let G,,j = AfX.X^. Then dj A Eij 
ftt^lfs- Specifically, 

Pr(|Qj - Fiji > e) < 8eiip{-Me'^7]^ /32) 
Proof. By the definition of Ci,j, we have : 



8exp(-Af(^ja/3,a)V8) 



Let 77 — min /3ia < 1. We obtain 

l<i<W 



(5) 



Pr 



G^Hj E{G,)E{Hj] 



> e 



< 18exp(-A'/e2^V32) 



1 \ Y'T 



(4) 



as M — >■ 00, where 1 = (1,1,..., 1)^ and the con- 
vergence follows because of convergence of numerator 
and denominator and then applying the Slutsky's the- 
orem. The convergence of numerator and denominator 
are results of strong law of large numbers due to the 
fact that entries in X.^ and XJ are independent. 



□ 

Corollary 1. Ci,i~2Ci,j + Cj,j converges as M ^ 00. 
The convergence rate is Ci exp(— Mc2e^77®) for e error, 
with ci and C2 being constants in terms of M . 

Corollary 2. G^.i — G^j converges as A/ — > 00. The 

convergence rate is di exp(— A/d2e^?7*) for e error, 
with di and d2 being constants in terms of M . 
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Recall that wc define Ck, k ^ 1, K to he the novel 
words of topic fc, and Cq to be the set of non-novel 
words. supp(/3i) denotes the column indices of non- 
zero entries of a row vector l3i of l3 matrix. 

Lemma 2. If i,j £ Ck, are novel words of the 
same topic), then Ci,i — 2Cij + Cj,j 0. Otherwise, 
yk, ifieCk,j^Ck, then d,; - 2Ci,j + A /(jj) > 
d > where d = \aI3'^- Especially, if i G Co and 
j iCo, then C,,i - 2C,,j + Cj^j A /(^j) > d > 



d. Hence, the probability that Di j be less than d/2 
will vanish. In addition, by union bound we have 

Pr(J. ^ J*) < Pr(J, ^ J*) 

= Pr(3j e : J i J,) 

< J2 Pi-(A,, < d/2) 



Proof. It was shown in lemma 1 that Cij — > 
■^R-^^, where R is the correlation matrix and a = 

Pi a p j a ' 

(oi, . . . , flif)^ is the mean of the prior. Hence 



Ci i 2Ci 



4 



A a /3ja 



A 
Aa 



A-a 



> A, 



Note that we've assumed R to be positive definite with 
its minimum eigenvalue lower bounded by a positive 
value, Aa > 0. 

If i,j e Ck for some fc, then — = and hence 
Ci^i 2Ci j -\- Cj,j y 



Otherwise, if supp(/3i) ^ supp(/3j), then 
> (note that Aa < 1) which 



ft a 



/3,a 



proves the first part of the lemma. 

For the second part, note that if i € Cg and j ^ Cq, the 
support of /3,; and /3j is necessarily different. Hence, 
the previous analysis directly leads to the conclusion. 

□ 



Recall that in Algorithm 1, Ji = {j : j ^ i,Ci^i — 
2Cij + Cj.j > d/2}. we have : 

Lemma 3. Ji converges in probability in the following 
senses: 

1. For a novel word i £ Ck, define J* = Ck^ . Then 
for all novel words i, lim Pr(Ji C J*) = 1. 

A'/— ^oo 

2. For a nonnovel word i G Cq, define 
J* = Cq^. Then for all non-novel words i, 

lim Pr(J,DJ*) = l. 

A/— >oo 



Proof. Let d ^ A^/J^- According to the lemma 2, 
whenever supp(A) ^ supp(A), Di,j - Ci,i - 2Cij + 
f{i,j) — ^ ^'^^ novel word i. In another 
word, for a novel word i € Ck and j ^ Ck, Di j will be 
concentrated around a value greater than or equal to 



Since X^j^c^ ^^i^i,] — d/2) is a finite sum of vanishing 
terms given i S C^, Pr( ^ J*) also vanish as A/ — > oo 
and hence we prove the first part. 

For the second part, note that for a non-novel word 
J G Co, Di j converges to a value no less than d provided 
that i ^ Co (according to the lemma 2). Hence 

Pr(J. tJ*)< Pr(J, ^ J*) 

- Pr(3j e j; : JO 

Similarly X^j^Co P'"(^*j' — d/2) vanishes for a non- 
novel word z G Co as A/ — > oo, Pr(,/i ^ J*) will also 
vanish and hence concludes the second part. □ 



As a result of Lemma 1, 2 and 3, the convergence rate 
of events in Lemma 3 is : 

Corollary 3. For a novel word i £ Ck we have Pr( ^ 
J*) < WciCXjpl^Mc^d^r]^). And for a non-novel 
word i G Co, Pr(J, ^ J*) < Kcicxp{-Mcid^ri^), 
where Ci, C3, and C4 are constants and d~ 

Lemma 4. /f Vi ^ 7, — > C. we have the 
following results on the convergence of Cij — Cij : 

1. If i is a novel word, \/j £ Ji C J* : dj — C'ij A- 
9{i,j) ^ 7 > 0, where J* is defined in lemma 3, 
7 = Coa/^a and is the minimum component of 
a. 

2. If i is a non-novel word, 3j £ J* such that C'i^i — 

Proof. Let's reorder the words so that i £ Ci. Using 
the equation (4), Ct.i A ^ and dj A Xfcii 



with 5fc - „K „ ■ 
and sum up to one. 



Not that 6/c's are non- negative 
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Ft ' R' 

By the assumption, — ^^-^ > C for j i. Note 
that Vj € Ji Q J* , there exists some index k ^ i such 
that bk ^ 0. Then 



bk 



ajflj ^ aiUk 



k=\ 
ki^i 



>cy]bk 



Since f3ja. < 1, we have J2k^i > > /^aOa, and 

the first part of the lemma is concluded. 

To prove the second part, note that for i g Co and 
J ^ Co, 



K 



fc=i 



with 6fc = Now define : 



■* A I, Rj-k 

J I = argmax > bk 



(6) 



We obtain. 



K K 



<l^bk 



As a resuh, C,,. - Q,,. A Efli Ef=i 



SfLi ^^^~7a~ — ^ proof is complete. 



□ 



A. 7. Proof of Theorem 1 

Now we can prove the Theorem 1 in Section 5. To sum- 
marize the notations, let /3a be a strictly positive lower 
bound on non-zero elements of /3, A a be the minimum 
eigenvalue of R, and be the minimum component 
of mean vector a. Further we define ri ~ min jS^a 

l<i<W 

and C — min — > 0. 

Theorem 1 (in Section 5.1) 

For parameter choices c? = Aa/3^ and 7 — Coa/Sa the 
DDP algorithm is consistent as M 00. Specifically, 
true novel and non-novel words are asymptotically de- 
clared as novel and non-novel, respectively. Further- 
more, for 



M > 



CiflogW^ + logf^ 



where Ci is a constant, Algorithm 1 finds all novel 
words without any outlier with probability at least 
I — Si, where ri = min 3^3.. 

l<i<W 



Proof of Theorem 1. Suppose that i is a novel word. 
The probability that i is not detected by the DDP 
Algorithm can be written as 

Pr(J» ^ J* or (J, C J* 

and 3j e J, : C,,, - C,,, < 7/2)) 

< Pr(J. ^ J*) 

+ Pr(( J. C J* and 3j £ J, : C,,, - C,,, < 7/2)) 

< Pr(J,; ^ J*) + Pr{3j G J* : C,,,; - C,,, < 7/2) 

< Pr(Jz ^ J*) + 51 - < 7/2) 



/32rfmin(A2/32,c2a2) 



The first and second term in the right hand side con- 
verge to zero according to Lemma 3 and 4, respec- 
tively. Hence, this probability of failure in detecting i 
as a novel word converges to zero. 

On the other hand, the probability of claiming a non- 
novel word as a novel word by the Algorithm DDP can 
be written as : 

Pr(J, ^ J* or (J, D J* 

and Vj G J^ : C,,, - C,,, > 7/2)) 

+ Pr{{J, D J* and Vj G J, : C,,, - C,,, > 7/2)) 

< Pr(J,; ^ J*) + Pr(V.? G J* : C,,, - C,,, > 7/2) 

< Pr(J, ^ J*) + Pr(CM - C,,,. > 7/2) 

where j* was defined in equation (6). We have shown 
in Lemma 3 and 4 that both of the probabilities in the 
right hand side converge to zero. This concludes the 
consistency of the algorithm. 

Combining the convergence rates given in the Corollar- 
ies 1, 2 and 3, the probability that the DDP Algorithm 
fails in finding all novel words without any outlier will 
be bounded by M^ei exp(—A/e2 min(d^, 7^)77*), where 
ei and 62 are constants and d and 7 are defined in the 
Theorem. □ 

A.8. Proof of Theorem 2 

Theorem 2 (in Section 5.2) For d = \^0j{, given 
all true novel words as the input, the clustering al- 
gorithm, Algorithm 4 (Cluster Novel Words) asymptot- 
ically (as M ^ 00 recovers K novel word indices of 
different types, namely, the support of the correspond- 
ing /3 rows are different for any two retrieved indices. 
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Furthermore, if 



M > 



C2(logW^ + log(^) 



then Algorithm 4 clusters all novel words correctly 
with probability at least 1 — (52. 



Proof of Theorem 2. The statement follows using ('^') 
number of union bounds on the probability that Ci^i — 
2Cij + Cjj is outside an interval of the length d/2 
centered around the value it converges to. The con- 
vergence rate of the related random variables are given 
in Lemma 1 . Hence the probability that the clustering 
algorithm fails in clustering all the novel words truly 
is bounded by eiW^ exp(— Me277*d^), where ei and 62 
are constants and d is defined in the theorem. □ 

A.9. Proof of Theorem 3 

Theorem 3 (in Section 5.3) Suppose that Algorithm 
5 outputs /3 given the indices of K distinct novel words. 
Then, (3 ^ (3. Specifically, if 

C^W\\og{W) + log(X) + log(l/<53)) 



M > 



then for all i and j , j3i_j will be e close to j3i_j with 
probability at least 1 — 5^, with e < 1, C3 being a 
constant, a a = min^ Oi and 77 = min /J^a. 

l<i<W 



Proof. We reorder the rows so that Y and Y' be the 
first K rows of X and X', respectively. For the opti- 
mization objective function in Algorithm 5, if z < K, 
b = e^ achieves the minimum, where all components 
of ei are zero, except its z*^ component, which is 
one. Now fix i, we denote the objective function as 
QAf(b) = M(Xi - bY)(X,^ - bY')^, and denote the 
optimal solution as b^^. By the previous lemmas, 

QA/(b) A Q(b) = bDRDb^ - 2bDR|J^ + ff^'R^, 
where D — diag(a)^^. Note that if R is positive defi- 
nite, Q is uniquely minimized at b* = ^^D^^. 

Following the notation in Lemma 1 and its proof, 
Pr(|Q,, - > e) < 8cxp(-A/e2ryV32) 



Ej T - -;^R-^=^, and 77 



where = MX.Xj 
min d,SL. Note that h e B = \h : < bk < 

l<i<W 

l,E^fc = !}■ Therefore, Vs, r e {l,...,K,i} : 



\Cs^r — Es^r\ < e implies that 
Vbe S:|QM(b)-g(b)| < ICm-^^mI 

K K 

+ ^ hk\Ck,i — Ek^i\ + ^ bk\Ci^k — Ei^k 



fe=i 

K K 



fe=l 



5Z $Z ^rbs\Cr,s — Er 



r=l s=l 

< 4e 

Hence 

Pr(3be6: |gA/(b)-g(b)| >4e) 
< Pr(3z,j e {1, . . . : |C,,j - > e) (7) 

Using (A' -1-1)^ union bounds for the right hand side of 
the equation 7, we obtain the following equation with 
ci and C2 being two constants: 

Pr(3be6:|QA/(b)-Q(b)|>e) 

< ci(A -I- 1)2 exp(-C2Afe2?78) (8) 

Now we show that b^^ converge to b* . Note that b* 
is the unique minimizer of the strictly convex function 
(5(b). The strict convexity of Q is followed by the fact 
that R is assumed to be positive definite. Therefore, 
we have, Veo > 0, 3(5 > such that ||b - b*|| > eo ^ 
Q(b) - 0(b*) > 5. Hence, 

Pr(|ib*,-b*|| >eo) 
<Pr(Q(b;,)-Q(b*)>5) 

< Pr(Q(b;,) - Qm{K,) + QuiKi) - QM{h*)+ 
QM{h*)-Q{h*)>5) 

< Pr(Q(b^,) - QA/(b;,) + QM(b*) - Q(b*) > 5) 

<^Pr(2sup|QA/(b)-Q(b)| >J) 
bee 

< Pr(3b e B : |QA/(b) - Q(b)| > 5/2) 

{Hi) / 

< ci(A + if exp {-^S'^T]^M 

where (i) follows because QmO^*m) — QmO^*) < by 
definition, (ii) holds considering the fact that b, b* G 
B and (Hi) follows as a result of equation 8. 

For the eq and S relationship, let y = b — b* , 

Q(b) - Q(b*) = y(DRD)yT > \\yfX, 

where A* > is the minimum eigenvalue of DRD. 
Note that A* > ( min aJ^VX/.,, where Aa > is a 

1<J<K ^ 
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lower bound on the minimum eigenvalues of R. But 



< flj < 1, hence A* > 

2 



Hence we could set 



S ~ X/^Cq. In sum, we could obtain 
Pr(||b^,-b*|| >eo) <ci(i^ + l)2exp(-4Afe;5A^,78) 

for the constants cl and Cj. Or simply b^j A b*. 
Note that before column normalization, we let (3i ~ 
(-^Xil)(b^j). The convergence of the first term (to 
/3,a), as we have already verified in Lemma 1, and us- 
ing Slutsky's theorem, we get /3,; A /3iD~^. Hence af- 
ter column normalization, which involves convergence 
of W random variables, by Slutsky's theorem again we 
can prove that f3i A f3i for any 1 < i < W. This con- 
cludes our proof and directly implies the convergence 
in the Mean-Square sense. 

To show the exact convergence rate, we apply the 
Proposition 7. For /3j before column normalization, 
note that jjXil converges to /S^a with error probabil- 
ity 2exp (— 2e^A/), we obtain 



Pr(|ft,,-ft,,aj| > e) < ei{K+l)^ expi^e^X^if Me^) 

+ 63 exp (-2e4e^Af) 

for constants Ci, . . . , 64. On the other hand, the col- 
umn normalization factors can be obtained by 1^/3. 
Denote normalization factor of the j**^ column by Pj = 
J2ZiAj and hence Pr(|Pj - a^l > e) < ciWiK + 
1)2 exp(-e2A2 77«Af eVW^^) + 63!^ exp {-Cie^M/W^) . 
Now using the Proposition 7 again we obtain that after 
column normalization. 



Pr 



'1,0 



> e 



< h{K + 1)2 CXp(-/2A2 ,78m,4^4 ) 

+ /3 exp (-2/452 A//a2) 
+ hW[K + if cM-.h>V.V^Me^al/W^) 

for constants /i , . . . , /§ and a/\ being the minimum 
value of Ci's. Assuming e < 1, we can simplify the 
previous expression to obtain 



bounds. Hence we have 



Pr 3i, j 



> e 



< biW'K{K + ly exp{-b2XiTl''Me^a%/W^) 

Therefore, the sample complexity of e-close estimation 
of I3i,j by the Algorithm 5 with probability at least 
1 — ^3 will be given by: 



M > 



aW^logiW) + log(/v) + log(l/<53)) 



□ 



B. Experiment results 

B.l. Sample Topics extracted on NIPS dataset 

Tables 4, 5, 6, and 7 show the most frequent words 
in topics extracted by various algorithms on NIPS 
dataset. The words are listed in the descending order. 
There are A/ — 1, 700 documents. Average words per 
document is N 900. Vocabulary size is = 2, 500. 

It is difficult and confusing to group four sets of topics. 
We simply show topics extracted by each algorithm 
individually. 

B.2. Sample Topics extracted on New York 
Times dataset 

Tables 8 to 11 show the most frequent words in topics 
extracts by algorithms on NY Times dataset. There 
are M = 300, 000 documents. Average words per doc- 
ument is « 300. Vocabulary size is W = 15, 000. 



Pr 



> e 



< biW{K + 1)2 cxp(-62A2 r^^A/e^a^/I^ 



for constants bi and 62. Finally, to get the error prob- 
ability of the whole matrix, we can use WK union 
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Table 4. Examples of extracted topics on NIPS by(Gibbs) 



Gibbs 


analog circuit chip output figure current vlsi 


Gibbs 


cells cortex visual activity orientation cortical receptive 


Gibbs 


training error set generalization examples test learning 


Gibbs 


speech recognition word training hmm speaker mlp acoustic 


Gibbs 


function theorem bound threshold number proof dimension 


Gibbs 


model modeling observed neural parameter proposed similar 


Gibbs 


node tree graph path number decision structure 


Gibbs 


features set figure based extraction resolution line 


Gibbs 


prediction regression linear training nonlinear input experts 


Gibbs 


performance problem number results search time table 


Gibbs 


motion direction eye visual position velocity head 


Gibbs 


function basis approximation rbf kernel linear radial gaussian 


Gibbs 


network neural output recurrent net architecture feedforward 


Gibbs 


local energy problem points global region optimization 


Gibbs 


units inputs hidden layer network weights training 


Gibbs 


representation connectionist activation distributed processing language sequence 


Gibbs 


time frequency phase temporal delay sound amplitude 


Gibbs 


learning rule based task examples weight knowledge 


Gibbs 


state time sequence transition markov finite dynamic 


Gibbs 


algorithm function convergence learning loss step gradient 


Gibbs 


image object recognition visual face pixel vision 


Gibbs 


neurons synaptic firing spike potential rate activity 


Gibbs 


memory patterns capacity associative number stored storage 


Gibbs 


classification classifier training set decision data pattern 


Gibbs 


level matching match block instance hierarchical part 


Gibbs 


control motor trajectory feedback system controller robot 


Gibbs 


information code entropy vector bits probability encoding 


Gibbs 


system parallel elements processing computer approach implementation 


Gibbs 


target task performance human response subjects attention 


Gibbs 


signal filter noise source independent channel filters processing 


Gibbs 


recognition task architecture network character module neural 


Gibbs 


data set method clustering selection immber methods 


Gibbs 


space distance vectors map dimensional points transformation 


Gibbs 


likelihood gaussian parameters mixture bayesian data prior 


Gibbs 


weight error gradient learning propagation term back 


Gibbs 


order structure natural scale properties similarity analysis 


Gibbs 


distribution probability variance sample random estimate 


Gibbs 


dynamics equations point fixed case limit function 


Gibbs 


matrix linear vector eq solution problem nonlinear 


Gibbs 


learning action reinforcement policy state optimal actions control function goal environment 
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Table 5. Examples of extracted topics on NIPS by DDP(Data Dependent Projections) 



DDP 


loss function minima smoothing plasticity logistic site 


DDP 


spike neurons firing time neuron amplitude modulation 


DDP 


clustering data teacher learning level hidden model error 


DDP 


distance principal image loop flow tangent matrix vectors 


DDP 


network experts user set model importance data 


DDP 


separation independent sources signals predictor mixing component 


DDP 


concept learning examples tracking hypothesis incremental greedy 


DDP 


learning error training weight network function neural 


DDP 


visual cells model cortex orientation cortical response 


DDP 


population tuning sparse codes implicit encoding cybern 


DDP 


attention selective mass coarse gradients switching occurred 


DDP 


temperature annealing graph matching assignment relaxation correspondence 


DDP 


role representation connectionist working symbolic distributed expressions 


DDP 


auditory frequency sound time signal spectral spectrum filter 


DDP 


language state string recurrent noise giles order 


DDP 


family symbol coded parameterized labelled discovery 


DDP 


memory input capacity patterns number associative layer 


DDP 


model data models distribution algorithm probability gaussian 


DDP 


risk return optimal history learning costs benchmark 


DDP 


kernel data weighting estimators divergence case linear 


DDP 


channel information noise membrane input mutual signal 


DDP 


image surface filters function scene neural regions 


DDP 


delays window receiving time delay adjusting network 


DDP 


training speech recognition network word neural hmm 


DDP 


information code entropy vector bits probability encoding 


DDP 


figure learning model set training segment labeled 


DDP 


tree set neighbor trees number decision split 


DDP 


control motor model trajectory controller learning arm 


DDP 


chip circuit analog voltage current pulse vlsi 


DDP 


recognition object rotation digit image letters translation 


DDP 


processor parallel list dependencies serial target displays 


DDP 


network ensemble training networks monte-carlo input neural 


DDP 


block building terminal experiment construction basic oriented 


DDP 


input vector lateral competitive algorithm vectors topology 


DDP 


direction velocity cells head system model place behavior 


DDP 


recursive structured formal regime analytic realization rigorous 


DDP 


similarity subjects structural dot psychological structure product 


DDP 


character words recognition system characters text neural 


DDP 


learning state time action reinforcement policy robot path 


DDP 


function bounds threshold set algorithm networks dept polynomial 
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Table 6. Examples of extracted topics on NIPS by RP (Random Projections) 



RP 


data learning set pitch space exemplars note music 


RP 


images object face image recognition model objects network 


RP 


synaptic neurons network input spike time cortical timing 


RP 


hand video wavelet recognition system sensor gesture time 


RP 


neural function networks functions set data network number 


RP 


template network input contributions neural component output transient 


RP 


learning state model function system cart failure time 


RP 


cell membrane cells potential light response ganglion retina 


RP 


tree model data models algorithm leaves learning node 


RP 


state network learning grammar game networks training finite 


RP 


visual cells spatial ocular cortical model dominance orientation 


RP 


input neuron conductance conductances current firing synaptic rate 


RP 


set error algorithm learning training margin functions function 


RP 


items item signature handwriting verification proximity signatures recognition 


RP 


separation ica time eeg blind independent data components 


RP 


control model network system feedback neural learning controller 


RP 


cells cell firing model cue cues layer neurons 


RP 


stress human bengio chain region syllable profile song 


RP 


genetic fibers learning population implicit model algorithms algorithm 


RP 


chip circuit noise analog current voltage time input 


RP 


hidden input data states units training set error 


RP 


network delay phase time routing load neural networks 


RP 


query examples learning data algorithm dependencies queries loss 


RP 


sound auditory localization sounds owl optic knudsen barn 


RP 


head eye direction cells position velocity model rat 


RP 


learning tangent distance time call batch rate data 


RP 


binding role representation tree product structure structures completion 


RP 


learning training error vector parameters svm teacher data 


RP 


problem function algorithm data penalty constraints model graph 


RP 


speech training recognition performance hmm mlp input network 


RP 


learning schedule time execution instruction scheduling counter schedules 


RP 


boltzmann learning variables state variational approximation algorithm function 


RP 


state learning policy action states optimal time actions 


RP 


decoding frequency output figure set message languages spin 


RP 


network input figure image contour texture road task 


RP 


receptor structure disparity image function network learning vector 


RP 


visual model color image surround response center orientation 


RP 


pruning weights weight obs error network obd elimination 


RP 


module units damage semantic sharing network clause phrase 


RP 


character characters recognition processor system processors neural words 
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Table 7. Examples of extracted topics on NIPS by RecL2 



RecL2 


network networks supported rbf function neural data training 


RecL2 


asymptotic distance tangent algorithm vectors set vector learning 


RecL2 


learning state negative policy algorithm time function complex 


RecL2 


speech recognition speaker network positions training performance networks 


RecL2 


cells head operation direction model cell system neural 


RecL2 


object model active recognition image views trajectory strings 


RecL2 


spike conditions time neurons neuron model type input 


RecL2 


network input neural recognition training output layer networks 


RecL2 


maximum motion direction visual figure finally order time 


RecL2 


learning training error input generalization output studies teacher 


RecL2 


fact properties neural output neuron input current system 


RecL2 


sensitive chain length model respect cell distribution class 


RecL2 


easily face images image recognition set based examples 


RecL2 


model time system sound proportional figure dynamical frequency 


RecL2 


lower training free classifiers classification error class performance 


RecL2 


network networks units input training neural output unit 


RecL2 


figure image contour partially images point points local 


RecL2 


control network learning neural system model time processes 


RecL2 


learning algorithm time rate error density gradient figure 


RecL2 


state model distribution probability models variables versus gaussian 


RecL2 


input network output estimation figure winner units unit 


RecL2 


learning model data training models figure set neural 


RecL2 


function algorithm loss internal learning vector functions linear 


RecL2 


system model state stable speech models recognition hmm 


RecL2 


image algorithm images system color black feature problem 


RecL2 


orientation knowledge model cells visual good cell mit 


RecL2 


network memory neural networks neurons input time state 


RecL2 


neural weight network networks learning neuron gradient weights 


RecL2 


data model set algorithm learning neural models input 


RecL2 


training error set data function test generalization optimal 


RecL2 


model learning power deviation control arm detection circuit 


RecL2 


tree expected data node algorithm set varying nodes 


RecL2 


data kernel model final function space linear set 


RecL2 


target visual set task tion cost feature figure 


RecL2 


model posterior map visual figure cells activity neurons 


RecL2 


function neural networks functions network threshold number input 


RecL2 


neural time pulse estimation scene figure contrast neuron 


RecL2 


network networks training neural set error period ensemble 


RecL2 


information data distribution mutual yield probability input backpropagation 


RecL2 


units hidden unit learning network layer input weights 
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Table 8. Extracted topics on NY Times by (RP) 



RP 


com daily question beach palm statesman american 


RP 


building house center home space floor room 


RP 


cup minutes add tablespoon oil food pepper 


RP 


article fax information com syndicate contact separate 


RP 


history american flag war zzz_america country zzz_american 


RP 


room restaurant hotel tour trip night dinner 


RP 


meeting official agreement talk deal plan negotiation 


RP 


plane pilot flight crash jet accident crew 


RP 


fire attack dead victim zzz_world_trade_center died firefighter 


RP 


team game zzzjaker season player play zzzjiba 


RP 


food dog animal bird drink eat cat 


RP 


job office chief manager executive president director 


RP 


family father son home wife mother daughter 


RP 


point half lead shot left minutes quarter 


RP 


game team season coach player play games 


RP 


military ship zzz_army mission officer boat games 


RP 


need help important problem goal process approach 


RP 


scientist human science research researcher zzz_university called 


RP 


computer system zzz_microsoft software window program technology 
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